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\ Abstract 

We propose to address the problem of a secondary resource allocation when a primary Incremental 
Redundancy Hybrid Automatic Repeat reQuest (IR-HARQ) protocol. The Secondary Users (SUs) intend 
to use their knowledge of the IR-HARQ protocol to maximize their long-term throughput under a 
constraint of minimal Primary Users (PUs) throughput. The Accumulated Mutual Information (ACMI), 
required to model the primary IR-HARQ protocol, is used to define a Constrained Markov Decision 
Process (CMDP). The SUs resource allocation is then shown to be a solution of this CMDP The 
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' allocation problem is then considered as an infinite dimensional space linear programming. Solving the 
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dual of this linear programming is similar to solving an unconstrained MDP A solution is finally given 
using the Relative Value Iteration (RVI) algorithm. 

I. Introduction 

Cognitive radio has been introduced in order to improve the efficiency of wireless networks 
(see e.g. 0], ED and (3]|). This paradigm allows Secondary Users (SUs) to opportunistically 
access the bandwidth of licensed Primary Users (PUs) adapting their parameters to limit their 
impact on the PUs performances. Initial works often consider an Opportunistic Spectrum Access 
(OSA) model for SUs. In this model, the SUs sense the PUs bandwidth intending to detect white 
spaces to communicate. OSA model targets a zero-interference policy. However, more recently 
an other model called Opportunistic Spectrum Sharing (OSS) has been proposed. OSS model 
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allows the SUs to interfere the PUs as long as the degradations on the PUs performances remain 
below a certain level. 

Resource allocation has already been proven to be an efficient tool to address OSS problems. 
In [|4]| the authors propose a power allocation that maximizes the secondary ergodic capacity 
under peak power constraint and under average interference -power constraint. Their model 
takes in particular sensing imperfections into account. In [5], the authors propose different 
constrained power allocations. In particular, they consider a constraint of average interference- 
power, peak power or outage probability for primary and secondary users. These two papers 
consider "secondary centric" constraints. Indeed, they only consider constraints taking into 
account secondary parameters like secondary peak power or secondary average power. Even 
though an outage probability constraint is considered in fl5], this constraint is mapped onto a 
secondary constraint. 

In [6] the author compares capacity regions of two power allocations coming from two different 
optimization problems. The first allocation is the solution to the maximization of the SUs ergodic 
capacity under the constraints of peak power, average power and average interference power. The 
second allocation is the same as the first one except that the constraint on average interference 
power is replaced by a PUs ergodic capacity loss. The results shown in [6]| leads to the conclusion 
that taking an "ergodic capacity loss" increases the ergodic capacity region. This idea is then 
used in to show that it is worth considering the primary protocol while allocating secondary 
resources. Indeed, in [7] the secondary users realizes an active learning in order to get some 
insights on the instantaneous interference its creates on the primary. These insights are then 
processed to realize joint power and rate allocation. 

In our case, we will consider that the primary does not exploit any Channel State Information 
at the Transmitter (CSIT), then it will not probe the channel to adapt its transmission param- 
eters. This make useless the active learning proposed in [jT). However, when the primary user 
implements an ARQ or Hybrid-ARQ protocol, some learning can still be done by listening to 
the feedbacks required by the protocol. In [|8], the authors study a cognitive channel where 
the PUs implement an ARQ protocol. They propose an information theoretical based approach 
for studying the secondary user capacity. The secondary user tracks the primary feedbacks 
which allow him to improve its throughput while limiting the primary throughput loss. In 
0, a secondary power allocation has been studied when the primary user is implementing 



December 13, 2012 



DRAFT 



3 



Incremental Redundancy Hybrid-ARQ protocol. Unfortunately, the protocol proposed in [|9) is 
restricted to HARQ with only two transmissions. Moreover, no optimization of rate or power 
has been proposed. This work has been extended to IR-HARQ with multiple rounds in IfTOl 
but still no power allocation is done by the secondary user. [fTTTl proposes to use ARQ protocol 
of the primary user in order to manage the interferences generated by the secondary user. To 
do so, they propose a finite Constrained Markov Decision Process (CMDP) (see e.g. [[T2|. |[T3l . 
031, Ifl5l0 to describe the state of the primary ARQ protocol. The secondary user communicates 
with fixed power and looks for the optimal on/off strategy. The choice of the action (on or off) 
is done in order to maximize the throughput of the secondary system under the constraint of 
primary throughput loss. 

The main contributions of this paper are the following. We show that a model of CMDP 
can also be used to describe the primary IR-HARQ protocol. The main difference with the one 
proposed in [1 1] is that the CMDP model we propose is not finite. Indeed, our model is based on 
the evolution of Accumulated Mutual Information (ACMI) (see e.g. QUI, El). ACMI is used 
in order to analyse the long term throughput (average number of received bits per unit of time) 
of the IR-HARQ. Since the ACMI is a continuous random variable, we proposed a model of 
CMDP with Borel state and action spaces in order to realize a joint power and rate allocation for 
the secondary user. Finally, we derive an algorithm based on the Value Iteration to approximate 
a solution. The solution obtain in this paper is also different from the one of [18] since in this 
paper, no instantaneous CSIT is required at Tx 2 . 

This paper is organized as follows. In section HU a description of the considered network and 
of the primary and secondary protocols is done. In section HH] we present how the model can be 
associated with Constrained Markov Decision Process. In section [IV] we show that there exists a 
solution to the proposed CMDP. In section [V] we show that the proposed CMDP can be viewed 
as a linear programming on infinite dimensional. In this section, we also give an algorithm based 
on the dual of the linear programming that that allows us to compute a solution. In section [VI] 
we give some simulation results. The conclusion of this work is finally presented in section NU\ 

II. System description and performances 

In this section we describe the network composed of the PUs and the SUs. We then present 
the HARQ protocol implemented by the PUs and the power and rate allocation used by SUs. To 
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study the two access protocols, we introduce the long-term throughput. The long-term throughput 
is a figure of merit employed to study the performances of HARQ protocols. We finally show 
how the OSS model can be solved as a constrained optimization problem. 

A. System Model 

We consider the network illustrated in Figure [1] It is composed of a primary and a secondary 
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Figure 1. Model of the considered network 



transmitters (respectively denoted Tx\ and Tx 2 ) intending to send packets to their respective 
receivers Rx\ and Rx 2 - The transmission of a packet by Tx\ incurs interferences on the received 
packet at Rx 2 . And conversely, the transmission of a packet by Tx 2 incurs interferences on the 
received packet at Rx\. Furthermore, we consider that the channel is a slotted block fading 
channel between Txi and Rxj G {1,2} where indexes 1 and 2 are respectively designating 
the PUs and the SUs). We assume stationary and ergodic channels such that the channel gains 
hfj are assumed constant over all the duration of the slot n. For all the slots n, we also assume 
that hJlj is independent from h™, •/ with ^ {i',f) that /i™ is independent from the noise and 

that /i™ is an independent and identically distributed (i.i.d) random variable such as a™- = \h^\ 2 
is an exponential random variable with mean cqj. The signal received at receiver i 6 {1,2} is 
then given as follows: 

y? = h£xi + hffi + z?, (l) 

where zf E C L represents the noise which is a complex circular white Gaussian random vector 
of size L channel uses that we consider without loss of generality of zero mean and unit variance. 
The input signals x" G C L and x™ are assumed to be complex circular white Gaussian random 
vectors of zero mean and respective transmit powers ftl and p 2 . We further assume that the 
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G C L are messages received at Rx^. The instantaneous Signal to Interference plus Noise 
Ratio (SINR) at Rxi at slot n is denoted by /3™ and is defined as follows: 

5. 77ze primary protocol 

We suppose that the primary user implements an IR-type HARQ protocol (see e.g. QiD). 
The transmitter Tx\ encodes packet of size b\ bits denoted by u into a codeword denoted by 
x of length NL cm. The codeword x is then divided into N codeblocks namely Xi, x 2 , xn 
of length L cu. The protocol happens as follow: Tx 1 sends the codeword x\ into the channel 
with equivalent rate obtained as r\ = j- = Nr[. If Rx\ decodes x\ successfully, it broadcasts 
an acknowledgement (ACK) bit into a feedback channel and Tx\ starts the transmission of the 
next packet in its queue. We consider in this paper that Tx\ is backlogged, i.e. we consider 
that Tx\ always has a packet to transmit in its buffer. We further assume the one-bit feedback 
channel to be instantaneous and error-free. If the decoding of x\ at Rx\ fails, it sends a negative 
acknowledgement (NACK) bit in the channel. Tx\ will then transmit x 2 on the next slot. When 
Rx\ receives x 2 , it does code combining between x x and x 2 and tries again to decode. This 
protocol keeps going until either Rx\ successfully decodes the current information packet or the 
iV codewords are used and the decoding of x fails. If the decoding is still unsuccessful after the 
N transmissions, an outage is declared and we assume that the packet is discarded. 

We will finally assume that Tx\ and Rx\ are oblivious of the presence of the secondary users 
so that they do not modify their transmission parameters t\ and p\ dependently on the presence 
or not of the secondary user. 

C. The secondary protocol 

We suppose that the secondary users can listen to the primary feedbacks (see (El) of the 
primary users and use these feedbacks to adjust an Adaptive modulation and Coding (AMC) 
scheme. Letting n be the index of the current slot, Tx 2 chooses its rate and power (p 2 ,r 2 ) e 
[0, P 2 m] x [0, R 2 m] accordingly to the primary state. P 2M is the maximum peak power allowed 
for Tx 2 and the maximum rates R 2 m in the AMC. We further assume that there is no Channel 
State Information at the Transmitter (CSIT) Tx 2 , i.e. Tx 2 is oblivious of an, a i2 , a 2 i and a 22 . 
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D. Performances of the primary and secondary protocols 

For the rest of this paper, we define the throughput as follows. 



Definition 1 ( [1191 . 11161 or lEOl ). The throughput is the average number of information bits 
correctly received per unit of time. 

For the primary system, the throughput is given by 

m (vr 2 



lim -E | V^l bits leu (3) 



where t is the time measured in terms of slots, is a reward which is R™ = r\ bits/cu if the 
current packet is successfully decoded after slot n and R% — bits/cu if not. E (J^n=i ^i) * s 
the expected number of information bits correctly received per channel uses up to slot t. The 
notation 771(^2) on the left-hand side of equation © is used to enhance the primary throughput 
dependence on the secondary power and rate allocation noted as 

The throughput of the secondary protocol can be computed in a similar way. Introducing the 
secondary one-step reward R 2 = r 2 bits/cu if the secondary packet is correctly decoded by 
Rx 2 at the end of slot n and RV; = bit/cu elsewhere, leads to the following expression for the 



secondary throughput: 



V2 7T2 

t— > 



Vm\*(±IQ). (5) 



E. Optimization Problem 

The secondary user intends to find a joint power and rate allocation n 2 = {(j>2' r 2)} 
maximizing the secondary throughput 772(^2) and guaranteeing a target throughput r\yr for the 
primary user. The optimization problem is summarized as follows 



neN 



% = SUp T] 2 (7T 2 ) 

(6) 



subject to r]i (n 2 ) > r\yr 
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III. From System Model to Constrained Markov Decision Process Model 

In this section, we present how the HARQ protocol of the PUs can be efficiently represented 
using Markov chain. This model will then allow us to introduce a Constrained Markov Decision 
Process (CMDP). We will traduce the constrained optimization problem © to an equivalent one 
in the CMDP framework. We finally give some intrinsic properties of the proposed CMDP that 
we will use to obtain results on the solution of the ©. 

A. Representing HARQ evolution with ACMI 

In order to perform its power and rate allocation, Tx 2 is assumed to know only the state of the 
primary IR-HARQ given by the Accumulated Mutual Information (ACMI) after k\ transmissions. 
The primary ACMI at Rx\ will be denoted by %\. We suppose, without loss of generality, that 

the beginning of those k\ rounds happens on slot 0. The ACMI can be defined as 

fci-i 

^ = C (/?!) , (7) 

i=0 

where the function C(x) = log 2 (l + x) is the Shannon capacity of a symmetric Additive White 
Gaussian Noise (AWGN) channel. The evolution of the HARQ protocol can be fully tracked 
using the parameters k\ and i* 1 . Indeed, the decoding failure event after k\ < Ni transmission 
is given by the event 

O kl = <n). 

Similarly, the outage event (decoding failure whereas the N\ codeblocks are sent) is defined as 

Nl = {ip< ri } 

An illustration of the IR-HARQ evolution of the primary user is given in figure |2] This 
representation shows how tracking k\ and %\ can help determining the current state of the 
primary HARQ protocol. The HARQ protocol accumulates mutual information until either there 
is a successful (this event is illustrated in Figure |2]> decoding or there is an outage event (this 
event is not illustrated in Figure 0). The state of the IR-HARQ system before slot n is then 
fully determined by the couple i™) where fc" G {0,1... N%} represents the number of 
transmissions performed before slot n. k\ is set to after a successful decoding and to vVi 
after an outage, z" represents the ACMI at Rx\ before slot n with the convention that after 
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Figure 2. Temporal evolution of the mutual information for the primary IR-HARQ protocol. 



a successful decoding or an outage, i" = 0. Using the event 0(n) = {i r { + C < ri}, the 
evolution from to (/c™ +1 ,z™ +1 ) can then be given as follows 

(Jfe? + l,q + C (/?")) if 0(n) and < iV - 1 
(JVi, 0) if 0{n) and fcj* = N x - 1 



(8) 

(1, C (/?")) if O(n) and k\ = N x 



(0,0) if C(n), 

0(n) describes the event 'not 0(n)'. 

Because of equation ©, the secondary power and rate allocation (pJ? , r%) will obviously impact 
on the evolution of and i\. It will then affect the performance of the primary system. Since 
there is no analytical expression for the primary throughput for all policies {p^ , r%), this problem 
cannot be solved using usual classical tools from optimization theory. However, Constrained 
Decision Markov Processes (CMDP) seem to be appropriate for solving this problem. Since the 
space in which the system evolves (space of all (fc™ , )), is neither discrete, nor continuous, 
we cannot describe its random evolution with neither discrete random variables nor continuous 
random variables. To circumvent this difficulty we will use the theory of CMDP on Borel Spaces 
j[2T| to model and solve the optimization problem ©. 

B. Constrained Markov Decision Process 

The CMDP definition (see e.g. H2||- OS) adapted to our problem is a tuple (§, A, W, Q, Ri,R 2 , Vit) 
where each component is defined as follows. 
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• The state space: § = {0, 1, iV} x [0,rx] is the set of all possible states of Tx\. At slot 
n, s 11 £ § characterized by s n = (k™,i™) is then observed by Tx 2 . 

• The action space: A = [0, P 2 m} x [0, Rqm] ls the space of all possible actions available for 
Tx 2 . At slot n Tx 2 will transmit a new packet using a couple power/rate given by a n = (p%, r%). 
For the rest of this paper, the set of all couples of states and action is denoted by K = § x A. 

• At slot n, the gains ct^, a± 2 , a 21 and a 22 are unknown by Tx 2 , they will be considered 
as disturbances. We consider them as belonging to the disturbance space W = ([0, +oo[) . 

• The system function g(-) is a deterministic function from S x A x W to S which traduces 
the evolution of the system from state s n £ § at slot n to state s n+ \ E S at slot n + 1 when the 
action a n is performed and when the disturbance is w n . <?(•) is defined as follows 

g(s n ,a n ,w n ) = (k n l +1 ^i +1 ) = s n+1 , (9) 

where k n+1 and are given using d8) where w n is used to compute /?". 

• The evolution of the system is statistically represented by the transition law denoted by Q 
and illustrated in figure |3] For a given measurable subset B E £>(§) and a couple (s, a) £ IK, 
the definition of Q is given by 

Q(B\s, a) = P (s n+1 = #(s n , a n , w n ) £ B\s n = s, a n = a) . (10) 

The expression of Q(B\s,a) is given in Appendix IA-AI 
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Figure 3. Illustration of the transition law Q(-\s,a), the state so = (0,0) represents a successful decoding of the packet of 
the PUs. 
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• At slot n, the definitions of the primary instantaneous rewards R 1 is given in section [TT] but 
are rewritten as function of s n and a n as follows 

R 1 (s n ,a n )=r 1 t {k=0} (s n ), (11) 

where 1a(s) is the function that is equal to 'V if s E A and is '0' otherwise. 
Similarly R^ri) can be written as function of (s n ,a n ) E K as 

R 2 (s n ,a n ,w n ) = rU {r n< iog2{1+&} (s n ,a n ,w n ). (12) 

We remark here that R 2 is independent from s n however, it depends of the action a n and on the 
disturbance w n . As proposed in ([22|. we will then introduce the average reward as 

R 2 (a n ) = r 2 "P (r™ < log 2 (l + ft)\p») . (13) 

C. Policies 

At slot n, we suppose that Tx 2 can store every visited states and every taken actions in 
a vector called "history" defined as h n = (s , <z , Si, a>u ■ ■ ■ , s n _i, a n _i, s n ). The space of all 
possible histories up to time n is recursively defined as H = § and M n = K" x S. 

Suppose now that, at slot n, Tx 2 has stored the history h n . Accordingly to h n , Tx 2 randomly 
chooses an action within the set A. This choice can more formally be written using a the 
conditional measure 7r%(-\hn). For any Ac A, the probability that Tx 2 chooses an action from A 
n% (A\h n ). Since for every time slot n Tx 2 must choose an action, nV; verifies that ir 2 (A|/i n ) = 1. 
A policy is then defined as a sequence of such tt 2 = {tt^ } neN . The set of all the possible policies 
is denoted by II. 

A policy is said to be randomized stationary if there exists a probability measure (p such that 
for all n, ^(-Ihn) = ip(.\s n ). The set of all randomized stationary policies is denoted by Urs- 

A policy is said to be deterministic stationary if there exists a deterministic function ( such 
that for all n, a n = ((s n ). Using measure notation, it means that for all n, ir^ (.\h n ) = 5^ Sn )(.\s n ). 
5 a (B) is the Dirac measure, that is the measure that is 1 if a E B and else. The set of all 
deterministic stationary policy is denoted by U DS . Note that we have the following inclusion 
Hds C U rs c n. 

Suppose that the initial state, s is drawn according to some probability distribution z/ . We 
will denote by EJ^ the expectation that is taken over the processes s n and a n with respect to 
the initial distribution Uq and the policy n 2 . 
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Into the light of the preceding definitions, one can note that the limits in ® and © are 
slightly modified as follows 

m O2, v ) = limmf j&ll ^2 R z( s n, a n )\ bits/cu (14) 



and 



Vi (^2,^0) = liminf \w v 2 ( Ri{s n , a n ) ) bits/cu. (15) 



Let Vq be given, the problem defined by © is modified as follows: 

77* = SUp T] 2 (7T 2 ,Z/ ) 

vr 2 en 



(16) 



subject to 771 (7r 2 , z/ ) > Vit- 
For the rest of this paper, stands for the set of all admissible policies, £1 is defined as f2 = 

{tt G I%i (7t 2 , f ) > ^ir}- 

IV. Ergodicity results, Consistency and Existence of a solution 

In this section, we will prove that every randomized stationary policy induces an ergodic 
Markov chain on §. This result is then used to prove that every policy in II is outperformed by 
a policy in Urs- This result will allow us to restrain the set of admissible policies Q to the set 
of randomized stationary admissible policies f2 fl n^. We finally provide a condition on r] 1T to 
guarantee that the optimization problem (fT6l is consistent. In other words, we give a condition 
on rjiT for f2 to be non-empty. We then prove that under the condition of ^ the optimization 
problem (fT6l) is solvable, that is, there exists ix 2 E II such that 772(^2) = vZ- 

A. Ergodicity results 

For every randomized stationary policy cp, the evolution of the state s n is evolving according 
to a Markov chain. Indeed, the probability that, at time n + 1 the system state belongs to a set 
B E £>(§) knowing that at time n the system was in state s and that the randomized stationary 
policy cp is used is given by 

Q v {B\s)= [ Q{B\s,a)cp{da\s). (17) 

J A 
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In a similar way, for alH E N, B E B(E>), s 6 § and ip E Urs, we introduce the t step transition 
probability measure which is the probability that at time n + t, s n+t E B knowing that at time 
n the system was in state s and cp is applied t times as follows 

Q° v (B\s) = S B (s) 

(18) 

Q t v (B\s) = f s Q t - 1 (B\s')Q v (ds'\s). 
We will now show the following theorem 

Theorem 1. Let ip E Urs be given, the Markov chain induced by <p is ergodic. That is, there 
exists a unique probability measure p v verifying 

p v {B) = [ Q v (B\s)p v {ds), \/B E B{$). (19) 

Proof: The proof of this theorem, is a direct consequence of the Lemma 3.3 of lETTl . This 
lemma adapted to our case is given here for ease of presentation. 

Lemma 1 ( [21] Lemma 3.3, p. 57). If there exists p, a measure on S such that 

//(§) > (20) 
Q(B\k) > /2(B), Vk el, MB E £(§), (21) 
then for every <£>°° E Urs there exists p^, a probability measure on S, such that 

sup||Q'(-| S )-jv(-)|U ^ (22) 



t— ¥00 



We further have that p v verifies the following property 

p v {B) = [ Q{B\s,ip) Pv} {ds), V£? E £(§). 

It then remains to prove that there exists a probability measure ji satisfying equations d20l) 
and (|2T|) . To do so, remark that the state sq = (0,0), is accessible from every other states, so 
physically represents a successful decoding of Rx\. The fact that s is accessible form every 
other states means that from every state s, there is a non-zero probability of a successful decoding 
at Rx\. Consider a state s E § and an action a E A, we have the following 



Q({s } \ 8 ,a) =¥(t 1 + C( / ian ) > n\i 1:P2 



(23) 
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Let a set B G B(S) be given, for every k G K, we have 

Q{B\k) = Q{B n {s } + Q(B n {s } |fc) 

> Q(fl D {s } |*) (24 ) 
= P {dj^- > n)) t B (s ) 

For all £ G B(S) consider fi(B) = P (c( i + ^L 21 ^ r i)) 1 b(s ). We finally have that for every 
k G K and every B G S(S), Q(S|Jfe) > //(-B) furthermore //(§) = P (c( 1+ p°" Q21 ) > r x ) > 0, 
where the last inequality comes from the fact that P 2 m is bounded. Equations (|20l) and (12TI) are 
both verified which proves that every randomized stationary policies induces an ergodic Markov 
chain on 8. ■ 
These ergodicity results also leads to the fact that for every randomized stationary policy, the 
primary and secondary throughputs can be written as follows 

r)i(v ,(p)= / Ri(s,<f)pJds), ie{l,2}. (25) 



Remark 1. For every ip G Urs and for i = 1 and i = 2, the right hand side of equation (1251) , 
does not depend on the initial distribution Vq. In the sequel, for every f we will then write 

Vii^) = f7i(fo,v0> ( 26 ) 

= / R i {s,if)p !fi {ds). (27) 
Js 



B. Domination of the randomized stationary policies 

The ergodicity property developed in the previous section, especially the result given by 
equation (1231 ) have been used in [fT5ll in order to show the following Lemma. 

Lemma 2 ( |fT5l , lemma 3.5 p. 448). If Q is non-empty, then the following holds a) For every 
initial distribution u and for every it G Q, there exists <p G Urs sucn that 
• if G Q and 
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Proof: The proof of Theorem [3] is similar to the one proposed in []23l for the general 
case of CMDP. We then emphasize the main differences. The proof directly build a randomized 
stationary policy. Define the occupation measure as follows 

71—1 

m «( r ) = - E p -o ((* B > « n ) e r) , r g s(s x A), 

n t=0 

and express r\x and r) 2 as 

rji(7r, uo) = liminf / Ri(s,a)m n (ds,da). 

Using the compactness of the space § in the theorem of Prohorov given in Appendix IB-AI gives 
that there exists a measure m and a subsequence rij such that m n . — >■ m. Using the continuity 
of Ri and i? 2 gives the following result 



r/j(7r,^o)= lim / R%(s, a)m n .(ds, da) = / Ridm. 

They then show that m can be disintegrated in m = tpp^ (see appendix IB-BI for the disinte- 
gration result) where p v verifies equation (Q3- This justifies the fact that m define the required 
randomized stationary policy. ■ 
Lemma [2] means that for every policy rr 6 f2, there exists a randomized stationary policy 
cp E il^nfi that performs better. We then say that H RS dominates II. The optimization problem 
(fT6l) can then be rewritten in 11^5 as follows 

V2 = su p m(<p) ( 28 ) 

<peen RS nn 

For the rest of this paper, we denote the set of feasible randomized stationary policies by £l RS = 

n RS n a 



C. Consistency of the optimization problem (|28l) 

Let r/io be the throughput of the primary user when the secondary user uses the allocation £ 
defined for all s £ S as Co( s ) — (0, 0). We giver hereafter a condition of consistency for (|28l l. 

Theorem 2 (Consistency of the optimization problem (|28T)). 

The constrained problem (|28T) is consistent if and only ifrjiT < Vw- By definition, the constrained 
problem (|28T) z'^ said to be consistent if and only if the set of all Q R $ is non-empty. 



We will now prove a lemma that will be used in order in the proof of [2l 
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Lemma 3. Let (pi G Urs and </? 2 G n#g />e fwo g/ven policies. Suppose that for every s G §, 
eosta -Ki(s) C A and ^(s) C A such that 

MKiis^s) = 1 (29) 

^ 2 (K 2 ( S )| S ) = 1 (30) 

Vai = (p^ r 2) G a 2 = (i4 r 2) e ^(s), a x >: a 2 , (31) 

where we write a\ >z 0-2 if either 'pi > p\' or 'p\ = pi and r\ > r| '. Under these conditions, 
we have the following inequality f?i(y?i) < 771(^2). 

Proof: Let cpi and cp 2 , verifying the hypotheses of proposition |3] Since both policies belong 
to Urs, their primary throughput can be written as follow 771 (<pi) = J s Ri(s,ipi)p ipi (ds) and 
771(^2) = f s Ri(s, <p2)p<p 2 (ds), which, using the definition of Ri(s, a) given in equation (fTT|) can 
be written as 771(^1) = Rip Vl ({s }) and 771(^2) = RiP V2 ({s }). Also, for i G {1,2}, p Vl ({so}) 
have to verify equation (fT9l) . which in this case can be written as follows 

P^({ s o}) = / Q^({so}|s)p w (ds), 
For every s 6 §, Q<pi({so}\s) is evaluated as follows 

Q^({so}\s)= [ v(ii + c( -^ ) ^Rx^aitpiidals), 

Jk V \l+P20C2lJ J 

Due to equation (f3TT) . we have that for all s G S, a x = (pg, 7" 2 ) e an d «2 = (p 2 , r|) G ^(s), 

we have that pi > p\. This obviously implies the following inequality 

P (k + C ( > R,\s, a,) <f( h + C (j^-) > Ri\s, a 2 ) . 

We finally have that for all s G §, Qi Pl ({so}\s) < Q V2 ({so}\s) which proves that 771(^1) < 
^1(^2). ■ 
Proof of theorem \2\- The proof of the direct part, that is "rji T < rj 10 =>■ £l RS 7^ 0", is 
obvious. Indeed if t]it < 7710 implies that ( G Qrs which in turn implies that VIrs 7^ 0- 

Let show now the converse part, that is, "FIrs 7^ Vit < Vw"- Suppose that Qr$ 7^ 0- 
This implies that there exists (p G Qrs- Due to the definition of a randomized stationary policy, 
(p must verify that for every s G 8, y(A|s) = 1- Applying Lemma © with ipi = <p, <p 2 = Co 



December 13, 2012 



DRAFT 



16 



and for every s G § Ki(s) = A, K 2 (s) = (0,0) leads to the conclusion that r)i((p) < ?7i (Co)- 
Considering now that r] 10 = 771 (Co) an d that cp G Cl RS , we have 

Vit < Viif) < Vw 

which conclude this proof. ■ 
Theorem |2] stands that the set of admissible policies is non-empty if and only if the PUs 
throughput constraint is below the throughput of the primary system in absence of SUs. This 
conclusion means that no policy in II can improve the throughput of the PUs which is logical 
for our model. 

D. Solvability of the optimization problem 

In this section, we show that there exists a solution to the optimization problem (128T ). That is, 
we want to show the following theorem. 

Theorem 3 (Solvability). If There exists ip G Urs sucn that, r] 2 (ip) = rj^ and T]i(ip) > r] 1T . 

Proof: The proof of this theorem is again similar to the one given in ITT3TI considering the 
same modifications as the one done in the proof of Lemma |2] ■ 
In this section, we have shown that the optimization problem (fT6l) can be reduced from IT to 
n RS , without loss of optimality. We have then given a condition on the primary throughput con- 
straint that guarantees the consistency of the optimization problem ([TBI . Under this consistency 
condition, we have finally shown that there exists an optimal randomized strategy. In the next 
section we will show how a linear programming approach can be used to give an algorithm that 
compute an optimal policy. 

V. The linear programming approach 

In this section, we show that the optimization problem ([TBI can be viewed as a linear 
programming in infinite dimensional space. We then give the dual formulation of the linear 
programming, and based on this dual formulation, we propose an algorithm to build an optimal 
policy. 
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A. Linear Programming Formulation 

In section [IV] we have shown that a probability measure m on § x A define a randomized 
stationary policy cp if m — tpm and rh verifies equation (fT9l ). This can be traduced as follows: 
m is a probability measure if and only if 



and m define some ip if and only if 



m(d(s, a)) = 1, (32) 

SxA 



L m(B) = m(B) - / Q(B\s, a)m(d(s, a)) = 0, VB G B(S). (33) 
We have also shown that the primary and secondary throughputs can be written as follows 

Viif) = / Ri{s,a)m(d{s,a)), % G {1,2}. (34) 



JSxA 

Consider the following two dual pairs of vector spaces (A1(K), J-"(K)) and (M. (S), ^"(S)). 
A^(K) (resp. A1(S)) is the space of signed measures on the space IK (resp. on the space §). 
J-"(K) (resp. J-"(S)) is the space of bounded measurable functions on IK (resp. on the space S). 
Let (•, -) K be a bilinear form for the dual pair (^M(K), J-"(K)) defined as follows 

(m,v) K — / v (s, a)m(d(s, a)). (35) 
Jk 

In a similar way, we introduce (•, -) s for the dual pair (Ai (S), J 7 (S)). The optimization problem 
(1281) can then be rewritten as 



i]\ = sup (m, R 2 ) K , 

meM(K)+ 

s.t.L m(B)=0, VBg6(S), 

(36) 

( m , 1 ) K = 1> 

where Ai(K) + is the cone of positive measures on K. Since L is a linear map from Ai(K) to 
A^(S), its adjoint Lq is the map such that for every v G ^(S) 

(L m 1 v) n = (m,L* v} K , (37) 

and is given by the following equation 

(L* v)(s, a) = v(s) - [ v(y)Q(dy\s, a), V(s, a) G K, f G JF(§). (38) 
Js 

December 13, 2012 DRAFT 



18 



Since v G v G J 7 (Si) and Q is strongly continuous, we have that LqV G v G J-"(K) and we 
have that Lq is a linear map from T(S) to J-'(IK) which guarantees that L is continuous with 
respect to the weak topology (cf. Ifl31 ) and shows that (|36l is a linear program. The constraint 
(m, Ri) K > r/iT can classically be reduced to an equality constraint using a slack variable 
a G R + such that (m,R 1 ) K + a = r] 1T . We then extend (•, -) K to (M(K) x R, J*(K) x R) as 
follows 

((m, x), (v, y)) = (m, v) K + xy. (39) 
We then obtain the following equality constrained linear programming 

772 = sup ((m, a), (R 2 , 0)) = J R 2 dm + a.O, 

s.t. L m(B) = 0, V5 G S(§) 

(m, 1) K = 1 (40) 
(m,R 1 ) K + a = 7] 1T , 
m G M+(K), a G R + 

Since the linear programming (|40T > is equivalent to the initial optimization problem (|28l ). it is 
consistent and solvable. The dual linear programming of (1401) can be written as follows 

s.t. rj + u(s) > R 2 (s, a) + XRi(s, a) + / u(y)Q(dy\s, a) (41) 

is 

n G J"(S), 77 G R, A G R + . 
We can already note that the dual programming (|4H is consistent since (u = 0, 77 = r 2 M, A = 0) 
is admissible. 



5. Absence of duality gap and strong duality 

The two linear programmings given in (|40T > and (|4TT ) are both consistent and ([401) is solvable. 
We then have that there exists a randomized stationary policy tp* G f2 such that = ^2 
Also since the linear programming (|4T) is the dual programming of (|40l) . the consistencies imply 
weak duality, that is < rf .. For the rest of this section, we will then suppose that r]i T < i] 10 . 
This condition implies the consistency of the primal optimization problem (|4Q|) . 
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We will prove that there is no duality gap between the primal linear programming (|40l i and 
its dual (|41"I) . that is t]\ = r/*. We further prove that (|4TT ) is solvable. That is, there exists a triplet 
(77, A, u) admissible for the linear programming (|4TT i such that r\ = rf. 

Theorem 4 (Absence of duality gap). There is no duality gap, that is rf = rf. 

Proof: For ease of notation, we introduce the linear map L(m,a) = (L m, (m,Ri) K + 
a, (m, 1) K ). The proof is realized by showing that the set 

H = {{L{m,a), ((m,a),R 2 ,0) + r) ,m G M{K)+,a 6l + ,rG R + } 



is closed (cf. [El). ■ 
We now show that there exists (u*,rf, A*) solution to the linear programming (l4TT i. 

Theorem 5 (Strong duality). The optimization problem (|4TT) is solvable. There exists a triplet 
(«*, rf, A*) feasible for (HTT) . 



Proof: First, remark that we have 

rf = inf rf - Ar/ 1T , (42) 

AGR+ 

where for a given A > 0, rf is given by the following equation 



77? = inf 77 

A ueFOS), 



s.t. 77 + tt(s) > i?2(s, a) + Ai?i(s, a) + u{y)Q{dy\s 1 a). 



(43) 



We will then proceed in two steps, (a) we will show that for every A 6 IR + , there exists (77, u) 
solution of (|43l , (b) we will show that there exists A solution of (1421) . 

As it has been shown in 10311 . solving (1431) is equivalent to solving an unconstrained MDP 
with one-step reward function R 2 (s,a) + XRi(s,a) keeping unchanged S, A and Q. Indeed, 
consider the following unconstrained MDP 

Vx = sup rix(<p), (44) 

where r)\((p) is given by the following expression T)\((p) = r/ 2 ((f) + \r)i((p). Using the same 
argument as before, we remark that the optimization problem (|43l is the dual of the optimization 
problem (l44l . Since both problems are consistent, the weak duality gives the following inequality 
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Vx < Vx- Using the ergodicity condition given in Lemma [TJ it is shown in [21 J that there exists 
a constant rj and a bounded function u such that 

77 + u(s) = sup < R 2 (s, a) + XRi(s, a) + / u{y)Q(dy\s 1 a)\ Vs G S. (45) 

aeA [ Js ) 

They have also show that if the couple (r]%,u\) satisfies equation (|45T) then 77 = r]' x . Remarking 

that the couple (77, u) defined by equation (1451) is admissible for the optimization problem (143T > 

leads to the inequality r)' x > r]\. This justifies that (r)\,ux) = 

We will now show that there exists A > optimal for (|42|) . For a given A > 0, let (\ be 

such that for every s G S, Cx( s ) ls an argument of the maximum of problem (143T ). The existence 

of such £ A is guaranteed by the compactness of A, by the continuity of R± and R 2 and by the 

strong continuity of Q (see Appendix IA-CI) . We denote by r]^ = 771 (Ca) and by r?2 = 772 (Ca) 

the primary and secondary throughputs. With trivial adaptations of the results of [fT3l . we show 

that r)i and r]\ are increasing functions of A and that rfe is a decreasing function of A. We can 

also show that the function if x is absolutely continuous. By absolute continuity of r]\, its right 

derivative must coincide with the ordinary derivative, that is 

drjl _ fdr^y _ A 
dX ~ \d\) ~ Vl ' 

Let the function w(X) be defined as follows w(X) = f]x ~ ^Vit- The expression of w(X) is 
not known thus it cannot be used directly to find the optimal value for A. Also, w(X) is not 
differentiable for every A. However w(X) is differentiable almost everywhere and we have 

dw(X) % 

Since we have shown that 77^ is an increasing (non-necessary continuous) function, w(X) is a 
convex function and then possesses a unique minimum. If we have that T]it < i]io, we have that 
there exists A* such that 77^ — 771^ < if A < A* and 77^ — 771T > if A > A*. This proves that 
A* is such that 77* = 7/v — X*rii T . ■ 
That far, we have proven that there is no duality gap between the linear programming (l40l i 
and its dual given in equation (l4Tj) . We have also shown that there exists a triplet (77, A, u) 
which feasible for the optimization problem (14TT >. This triplet can be found with the structure 
(t&,A*,u a *)- 
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C. A dual based algorithm for finding an optimal policy 

In this section, we propose an algorithm based on the dual programming (14TT) for finding an 
optimal solution of the linear programming (140V This condition implies the consistency of the 
problem (fT6T ). We have shown that there exists an optimal policy for the optimization problem 
(fT6l) . Let ip* be an optimal policy for the optimization problem ([ToT l. 

For every A, we determine (r}\,u\) such that (77^, \,u\) is feasible for the dual programming 
(I4TT) using the Relative Value Iteration (RVI) algorithm |[T4~ll . The RVI algorithm is given as 
follows. 

Algorithm 1 Relative Value Iteration (VI) Algorithm 

i: «(°)(s) <- 0, Vs G § 
2: for k = 1 to 00 do 

3: Ws G S, u (fe) (s) <- max aeA R 2 (s, a) + Ai?i(s, a) + / g u^~^(s / )Q(ds / |s, a) 

4: Vs G S, W (fe) (s) <- uW(s) - uW(so) 

5: end for 

6: Vs G § Ca( s ) ^ argmax aeA R 2 (s,a) + \Ri(s,a) + Lu\(s')Q(ds'\s,a) 

1: U\ = 

8: rjx = max aeA R 2 {s , a) + XRi(s , a) + f s u x (s')Q(ds'\s ,a) 



ETTl shows that under the ergodicity results of lemma [T] this algorithm converges. 
The compactness of A and the continuity of the function U\(s,a) = R 2 (s,a) + XRi(s,a) + 
Lu\(s')Q(ds'\s, a), ensure that for every s 6 §, K\(s) = argmax aeA U\(s, a) is a non-empty 
compact set. We can then build £ A for every A > 0. This allows to compute 77^ for every A. We 
then find A* with a standard dichotomy algorithm. 

We will now use the following theorem of [fT3l to find an optimal policy for the optimization 
problem (TT6T) . 

Theorem 6. Let <p G Urs, tfVx*^) = Vx* an d Viiv) = Vit then ip is optimal for the optimization 
problem (fT6l) . 

Obviously in our case, if 771 (Ca*) — Vit then ( x * is optimal. Suppose now that 771 (Ca*) 7^ Vit- 
We will construct a policy verifying the conditions of Theorem [6l Since the (|28T) is solvable, 
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there exists an optimal policy denoted by ip*. Furthermore we have that ip* verifies that for every 

s e § 

r} X *+u x *(s) = R 2 (s,ip*) + \*R 1 (s,ip*)+ [u x *(s')Q(ds'\s,ip*), 
= max R 2 (s, a) + \*R\(s, a) + I u\*(s')Q(ds'\s, a) 

aeA J § 

This implies that for every s E S, ip(K\*(s)\s) = 1. It means that s E §, the policy ip* takes its 
actions only among K\*(s). Consider A with the order defined by y in Lemma [3] Since K\*(s) 
is compact, it possesses a maximal element for the relation y, that is, there exist a(s) such that 
for all a E K\*(s), a(s) y a. Similarly, we can define a(s) as the minimal element of K\*(s). 
For all s E § take C + ( s ) — <±( s ) an d C~( s ) = a(s). By construction of ( + and £~ and due to 
the lemma [3] we have that 7/t(C~) < T]i((p) < ^i(C + )- 

Also, we can easily verify that 77^* = ^a*(C + ) — ^a*(C )• Using now the same continuity 
arguments as in [fT3l . the function (3 h-> r]i([3( + + (1 — 0)C~) is a continuous function of (3. 
Since for j3 = we obtain 771 (C ) ^ Vvr and for j3 = 1 we obtain r?i(C + ) > f]vr, there exists /3* 
such that 77! (/3*C + + (1 — = Vit- The policy [3*( + + (1 — /3*)C _ verifies the assumptions 

of Theorem |6j it is then optimal for [28] 

In this section, we have shown that it is possible to find an optimal policy as a mix between two 
deterministic stationary policies. Those two policies are computed with a dynamic programming 
called Relative Value Iteration. 

VI. Simulation Results and discussion 

A. Simulation Results 

In this section, some simulation results are proposed. We consider that the PUs are using an 
IR-HARQ protocol with a maximum of 4 retransmission so that Ni = 5. The equivalent rate 
of the first block is r 1 = 5 bpcu. Tx\ is communicating with a normalized power pi = 3.16 
(without unit) which corresponds to 5dB. This leads to the state space S = {0, 1 • • • 5} x [0, 5]. 

The SUs can use normalized powers within the set [0, 10] (without unit). The set of available 
rates is [0,4] (bpcu). This leads to an action space A = [0, 10] x [0,4]. 

The parameters of the exponential fading are taken as follows: An = A21 = A12 = A22 = 1. 
This completely define W. 
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In general, no closed-form expression can be given for the equations of the RVI algorithm 
(Algorithm [T). We will then approximate the continuous solution by quantifying the state space 
§ with 100 values linearly spaced, 



{0,1,--- ,5} X {0,^,26; 



996*} 



where e, is chosen such that 



= 99. Similarly, we quantify the space p 2 as follows {0, e p , 2e p , • • 
One can remark that the optimizations of the Algorithm [1] can be done in a first time on r 2 
and in a second time on p 2 . The action space A is then quantified using the following set 

A = {(0,0), (e p ,r* (e p )) , (2e p ,r* (2e p )) , • • • , (99e p ,r* (99e p ))} , 

where r£(p 2 ) = max r2 R 2 (p 2 ,r 2 ). The function r£(p 2 ) is given in Figured for p 2 G [0, 10]. 



99e p }. 




3 4 5 a 7 

p 2 (without unit) 



Figure 4. versus p2 

We then use S and A to replace the integral by a sum in Algorithm [TJ We give the result 
obtained for A = 0.4755 in Figure |5l To summarize the results obtain for different values of T)it, 
we introduce the throughput region. The throughput region is the set of every couples (^1,^2) 
achievable while using an allocation. In Figure |6l we give the throughput region corresponding 
to the allocations computed using Algorithm [TJ 



B. Discussion 

• It is important to remark that quantifying S and A can lead to bad approximations of the 
continuous solution. However, it can be shown that due to the Lipschitz continuity of the reward 



December 13, 2012 



DRAFT 



24 



10 



8 



i 4 



2 



().. r > I 1.5 2 2.5 3 3.5 4 4.5 5 

ii for block 4 (in bpcu) 



Figure 5. Example of a power allocation for the forth block of the IR-HARQ protocol. This result has been obtained for 
A = 0.4755. 




0.2 0.4 0.6 0.8 1 1.2 1.4 

tfl (in bpcu) 



Figure 6. Throughput region of the proposed power and rate allocation. 

functions R\ and R2, due to the Lipschitz continuity of the transition kernel Q and due to the 
ergodicity results, finer quantizations of § and A lead to better approximations i] X and £ A . Since 
the strategies computed with the quantified versions of § and A are admissible for the original 
problem, the corresponding throughput region is a inner bound of the "true" throughput region. 
Effects of the quantification have been studied in lETTl or ll25Tl . 

• One can also note that T22 must know s = in order to compute the proposed 

power and rate allocation. The problem we consider is, in that sense, completely observable 
(CO-MDP). However, note that k can be tracked easily by "counting" the feedbacks of the PUs. 
The knowledge of %\ can be more difficult to acquire. One can then use a partially observable 
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model (PO-MDP). PO-MDP are presented in G3 or J2I). It is well known that PO-MDP can 
be difficult to solve however some approximations are proposed in ll26l . 

VII. Conclusion 

In this paper, we proposed to realize a secondary rate and power allocation when the SUs are 
aware of the parameters describing the PUs IR-HARQ protocol. We have shown that the ACMI 
model used to study the IR-HARQ protocol imposes a CMDP formulation for the allocation 
problem. Based on this CMDP formulation, we have also given conditions for the allocation 
problem to be consistent. We have shown that under the condition of consistency, an optimal 
allocation exists among the class of the randomized stationary policies. Taking into account 
the structure of the optimal policy, we have shown that this former is a solution of an infinite 
dimensional linear programming. We then solved the linear programming via its dual which is 
solved using the RVI algorithm. We finally propose and comment some simulation results. 

The quantifications of the state and action spaces imply an increasing of the complexity of 
the RVI. However, lower complexity algorithms can be used to learn suboptimal policies. Future 
work will then be dedicated to the study of approximations leading to these learning algorithms. 

Appendix A 
Properties of the CMDP 

A. Expression of Q(B\s,a), B G B(§>), s G S and a G A 

Fix n > 0, (s,a) G K and B G B(S), the definition of Q(B\s,a) is given by equation (flOl) 
that we rewrite for sake of clarity 

Q(B\s,a) = P(s n+1 G B\s n = s,a n = a) , 

where s n+1 = (k™ +1 s n = z") and a n = (p%,r%). Keeping these notations and 

adding the disturbance w" = a" 2 , a™ 2 , a^), s n moves to s n+1 according to the following 
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deterministic function 



g(s n ,a n ,w n ) 



(«.0).if<? + c(i^) 
otherwise 



Q"lPl 



- ), if 



(#i,0), if fef = iVi - 1, 

Equation (flOl) can be rewritten using the function g as 

Q(5|s, a) = P (#(s n , a n , iy n ) G _B|s n = s, a n = a) 

t B {g{s,a,w)) f w (w)dw, 



(46) 



where, Yli 2} /aij( a y'V a y- Due t0 equation (l46l) from a state of the form s = (k™,i™) 
the only accessible states are (0,0) (it corresponds to a successful decoding) or (A;™ + l,z" +1 ) 
(corresponding to a decoding failure). 

(5(5|s,a) = l B (s')Q(ds'\s,a) 

where (^((is'lsja) is expressed as 

Q(ds'\s,a) = po(ii,p 2 )S (dk')8 (di' 1 ) + S^+iidk')!^^ (ii) f i > i (i' 1 \ii,p 2 )di' l , (47) 

where p (h,p 2 ) = P (ii + log 2 (l + i^w ) >n|ii, P2) and /^(iilii,^) is the conditional 
probability density function of the random variable i[ defined as i' x = i\ + C 



1+«21P2 

If we consider now the case where k\ = N± — 1, after the same kind of calculation, we obtain 

Q(ds'\s,a) = p (ii,p 2 ) doidk^Soidi'i) + (1 - p {h,p 2 )) 5 Nl (dk')5 (di' 1 ) (48) 

If we consider now the last case, k\ — Ni, we obtain 

Q(ds'\s, a) = Po (0,p 2 )5 (dk')5 (dt[) + S^dk')!^ f i[ (i' 1 \0,p 2 )dif 1 (49) 

Note that we logically find here that the expression of the evolution from a state (0, 0) is the 
same as the one from a state (Ni,0). This is consistent with the fact that those two classes of 
states correspond to the start of the transmission of a new packet. 
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B. Boundedness of the one-step reward functions and of the long-term reward function 

By definition of the functions Ri(s, a) and R 2 (s, a) are bounded functions for every (s, a) G K. 
For every initial distribution z/ and every policy n G II, 77i(z/ ,7r) and ^72(^0? ^) are then also 
bounded. 

C. Lipschitz continuity of the one-step reward functions and of the transition kernel 

We will now give some Lipschitz continuity properties. The spaces § and A are considered as 
subspaces of M 2 endowed with the sup norm IHI^. We then compare two states s = (ki,ii) G § 
and s' = (k'^i^) G § as follows, 

II s ~ S 'IL = maxflfci - k[\, \ix - (50) 
Similarly, we compare two actions a = (p 2 ,r 2 ) G A and a' = (p' 2 ,r' 2 ) G A as follows 

11° - a 'IL = max(|p 2 -P2I \ r 2 -r' 2 \). (51) 

We then consider K as a subset of R 4 endowed with the sup norm and the distance between 

k = (s, a) G K and k' = (s', a') G K is given as follows 

\\k - k'W^ = max(||s - s'lL , \\a - a'\\J. (52) 
We finally give a distance between two probability measures on 8, pi and p 2 as follows 

d(Pl,P2) = ||Pl-P2|| T y, (53) 

= 2 sup \ Pl (B)-p 2 (B)\. (54) 

BeB(E>) 

\\-\\ TV is called the total variation norm. For any finite signed measure on 8, 771, 1 1 " 1 1 jT 1 ^/ 

defined 

as follows 

IMIItv — SU P iTi(B) — inf m(B). (55) 

Property 1 (Lipschitz continuity properties). The functions k >->■ R±(k) and k h-> R 2 (k) and 
k | — > Q( m \k) are Lipschitz continuous functions of k for all k G K. That is, for all k = (s, a) G K 
and A;' = (s', a') G K we /zave ?/za? ?/zere exists three positive scalars K\, K 2 and Kq such that 

\R\{k) - Ri(k')\ < K.Wk-k'W^, (56) 

\R 2 (k) - R 2 (k')\ < K 2 \\k- k'W^ and (57) 

\\Q(.\k) - Q(.\k')\\ TV < KgWk-k'W^. (58) 

December 13, 2012 DRAFT 



28 



The two first properties after a direct application of their definition. The proof of the third one 
is tedious and then is omitted. Note that they imply that Ri and R 2 are continuous functions. 
Property \T\ also implies that Q is strongly continuous, that is, for every measurable bounded 
function u, the function k \-t J s u(s)Q(ds\k), is continuous and bounded on K. 



A. On Tightness and relative compactness 

Tightness and relative compactness are two notions related to the study of the convergence of 
probability measures. Their definition is given as follows. 

Definition 2 ( ll23l p. 186, Definition E.5). Let V a family of probability measures on §. 

1) V is said to be tight if and only if for every e > there exists a compact set Si C § such 
that VmeP: m(S'i) > 1 - e. 

2) V is said to be relatively compact if and only if for every every sequence in V contains 
a convergent subsequence, that is, for every sequence {m n } in V there is a subsequence 
{m ni } and a probability measure m on §> such that m n% m. 

In this definition, the notation m n m refers to the weak convergence of the sequence m n 
to the measure m, that is for every continuous and bounded function v on S, 

/ vdm n — > / vdm. 



A link between tightness and relative compactness is done by the Prohorov's theorem given 
as follows. 

Theorem 7 ( 11231 , Theorem E.6 p. 186). Let V be a family of probability measures on a metric 
space S. 

1) IfVis tight, then it is relatively compact. 

2) IfB> is separable and complete. If V is relatively compact, then it is tight. 
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B. Disintegration of measure 

We give a disintegration-of-measure result given in [22] or ||23l . 

Proposition 1. Every measure m on § x A, for every B e £>(§) and every C G B(A), there 
exists a stochastic kernel ip on § given A such that 
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where rh(B) 



m{B x A), WB e B(S). 



This disintegration will be denoted m = ipm. 
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